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I.  Introduction 

The  increasing  population  of  information  centers  and  computer  sys¬ 
tems  with  remote  terminals  provides  the  necessary  ingredients  for  the  devel¬ 
opment  of  real-time  information  systems.  Such  a  system  has  many  advantages 
over  conventional  systems. 

The  most  important  advantage  is  obviously  the  time  factor.  While 
saving  of  time  is  of  importance  to  all  users,  in  many  cases  it  spells  the 
difference  between  success  and  failure  of  a  mission  or  induces  the  adoptation 
of  less  attractive  alternatives.  Another  major  advantage  of  real-time  remote- 
access  systems  is  better  utilization  and  sharing  of  resources,  hence  the 
reduction  of  the  high  cost  of  maintaining  local  information  stores.  Library 
networks  are  good  examples  of  this. 

Conventional  document  retrieval  systems  are  batch  processed.  The 
response  time  runs  into  hours  or  even  days.  These  system^  are  mostly  tape 
oriented  with  human  interface.  Real-time  on-line  systems  are  not  only  fast 
but  also  can  be  organized  to  improve  the  efficiency  of  the  system.  The 
response  time  should  be  in  seconds  or  ac  most  minutes. 

Current  document  retrieval  systems  are  often  run  in  off  hours  in 
batch.  This  would  be  undesirable  for  on-line  systems.  To  attain  the  econom¬ 
ical  objective  the  on-line  svstem  could  either  be  part  of  a  t imc-sliaring 
system  or  have  a  large  number  of  terminals,  and  therefore  users.  Either 
solution  would  put  severe  restrictions  on  the  response  lime  ol  itu:  system. 

The  purpose  of  this  paper  is  to  evalii.ite  the  pi'r  torm-unce ,  particularly  re¬ 
sponse  time,  of  a  system  totally  devoted  to  the  purpose  ot  d'cnmeut  iitricv.al. 


Two  organizations  and  search  strategics  are  evaluatid. 


We  also  pr.  e ■; t  a 
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method  of  processing  which  appears  to  give  for  linear  files  the  best  response 
time  with  a  given  number  of  terminals.  This  new  method  utilizes  a  two  level 
search  organization. 

The  organization  of  this  paper  is  as  follows.  A  discussion  of  basic 
concepts  in  file  organization  is  reviewed.  An  analysis  is  then  given  to 
computational  algorithms  in  retrieval.  Several  basic  search  routines  are  then 
defined  and  shown  to  envelope  all  computational  tasks.  Section  III  is  devoted 
to  the  discussion  of  queuing  in  terms  of  the  processing  time.  Sections  IV 
and  V  are  devoted  to  the  analysis  of  the  response  time  of  inverted  files  and 
linear  files  as  imbedded  in  their  respective  queuing  environments. 
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II.  File  Organization  and  Basle  Processing  Techniques 

The  most  common  document  retrieval  system  makes  use  of  coordinate 
indexing  and  searching  on  linear  files.  Each  document  is  assigned  a  given 
number  of  index  terms,  usually  not  more  than  fifteen  per  document.  On  the 
linear  file  each  record  contains  information  of  a  document  and  its  index  terms. 

The  queries  are  formed  by  choosing  a  set  of  index  terms.  A  simple 
query  would  require  matching  of  a  subset  of  the  terms.  This  type  of  opera¬ 
tion  usually  results  in  an  abundance  of  Irrelevant  output  documents.  A 
better  formulation  would  be  to  assign  as  query  a  Boolean  function  of  the 
terms,  e.g.. 


(AVBVC)A (DVEVF)A (GVH) . 

Although  the  linear  file  remains  the  same,  processing  of  a  Boolean  query  would 
take  more  time  and  hopefully  yield  more  satisfactory  results.  The  Boolean 
query  and  linear  file  are  the  basis  of  many  operational  large  scale  document 
retrieval  systems.  Linear  files  are  suited  for  batch  processing  and  conven¬ 
iently  implemented  on  magnetic  tapes.  Their  organization,  however,  hinders 
response  time  as  the  same  processing  routine  is  executed  whether  one  query 
or  a  hundred  queries  are  presented.  In  fact,  up  to  a  certain  point  when  the 
system  ceases  to  be  input-output  limited  the  response  time  is  essentially  con¬ 
stant. 

Each  record  in  an  inverted  file  consists  of  a  single  index  term  with 
the  location  number  of  those  documents  which  use  this  term  in  their  index. 

When  a  query  is  submitted,  the  inverted  file  portions  of  those  terms  given  in 
the  query  are  retrieved.  Further  processing  based  on  the  linear  or  Boolean 
query  is  done  on  these  inverted  file  portions  to  extract  the  accession  number 


of  those  documents  which  meet  the  requirements.  Although  the  processing  of 
a  single  query  may  be  reasonably  fast,  the  system  is  loaded  down  quickly  when 
the  demand  increases.  A  detailed  analysis  of  this  will  be  given  in  Section 
IV. 

When  the  number  of  terminals  in  the  system  is  small  the  processing 
time  is  negligible  as  compared  with  time  needed  for  search  in  the  files. 
However,  in  this  mode  of  operation  tne  system’s  processing  cspabilitv  is  not 
well  utilized.  When  :'.he  number  of  terminals  increases  the  processing  time 
increases  rapidly  while  more  records  are  retrieved  from  the  file.  This  re¬ 
sults  in  a  more  favorable  situation  for  the  search  operation  as  more  overlaps 
in  mechanical  motion  are  possible.  Hence  the  bottleneck  in  retrieval  grad¬ 
ually  shifts  toward  the  processor,  To  accurately  analyze  the  response  time 
it  is,  therefore,  necessary  to  get  a  reasonably  accurate  estimate  of  the  pro¬ 
cessing  time  of  the  search  algorithms.  What  follows  is  a  detailed  discussion 
of  the  various  algorithms  and  their  estimated  processing  time.  Although  no 
attempt  is  made  to  show  their  optimaliLy,  Lh.ese  algorithms  are  believed  to  be 
an  accurate  indicator  of  the  computational  complexity  of  their  respective 
search  missions. 

A  basic  search  algorithm  relates  to  matching  of  terms  between  two 
sequences  of  location  numbers  Whenever  a  match  is  observed  a  count  is  made. 
The  document  location  number  is  stored  whenever  the  count  exceeds  a  fixed 
threshold.  This  algorithm  is  also  useful  for  initial  screening  of  documents 
in  a  two  level  linear  file  Boolean  search.  In  this  case  the  threshold  is  set 
at  one.  This  threshold  is  implictbiy  assumed  in  the  description  of  the  BASIC 


SEARCH  ALGORITHM  below. 
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BASIC  SEARCH  ALGORITHN; 


Given : 


'1 


LJ  -  -  -  u 

1  y 

Initiate;  i  =  x;  j  =  y; 

*  ;  a.  -  =  0;  PRINT;  EXIT; 

a.  -  b .  <  0; 
i  J 

j  =  j-1; 

:=0;  EXIT; 


j?tO;  GO  TO  *; 


a.  -  b .  >  0; 
1  J 


i  =  i  -  1; 
i  =  0;  EXIT; 

i  #  0;  GO  TO  *; 


An  examination  of  this  typical  search  algoiithm  reveals  the  fact 
that  the  number  of  loops  it  goes  through  is  bounded  by  x+y.  As  the  program 
stops  whenever  i  or  j  equals  zero  some  saving  is  possible.  We  show  in  the 
appendix  that  when  this  savings  is  taken  into  account,  the  average  number  of 
loops  6  is  given  by 


e 


V 

X  -  +  y 

y+1 


X 

x+1 


We  shall  then  estimate  the  computation  time  of  the  BASIC  SEARCH  ALGORITHM  to 
be 


6 


=  4eu  (1) 

where  4  is  the  length  of  the  loop,  and  u  is  the  average  time  for  processing 
one  instruction, 

BASIC  BOOLEAN  SEARCH  ALGORITHM  FOR  LINEAR  FILES 


Given : 

i  =  1,  2, 

i 

0 

j(l)  j(2) 

-- 

SEARCH; 

A. 

YES; 

GO  TO  *; 

NO; 

j  =  j  -  1; 

j  =  0;  EXIT; 

j  #  0;  GO  TO  SEARCH; 

* 

i  =  i  -  1; 

i  0; 

PRINT;  EXIT: 

i  0; 

j  =  j(i);  GO  TO  SEARCH; 

The  average  length  of  the  loop  is  five  plus  the  processing  of 

SEARCH  A. ,  which  is  a  special  case  of  the  BASIC  SEARCH  ALGORITHM.  It  is 

estimated  that  SEARCH  A. .  shall  take  at  most  (d+l)4u  steps  where  d  is  the 

ij 

average  number  of  terms  per  document.  Hence  the  processing  time  for  the 
BASIC  BOOLEAN  SEARCH  ALGORITHM  is 

t^  =  h[  5u  +  (d+l)4u]  =  4hu[d+2] 


(2) 
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where  h  is  the  average  number  of  terms  in  a  request. 

Similarly  we  may  construct  two  basic  search  algorithms  for  the 
inverted  file: 

BASIC  BOOLEAN  ’’OR"  ALGORITHM  FOR  INVERTED  FILES 

Given:  a  =  b  =  0 

o  0 


Initiate:  i  =  x;  j  =  y; 

*  :  a.  -  b.  =  0;  PRINT  a.; 

i  =  i  -  1; 

i  f  0;  GO  TO  *; 

i  =  0;  j  ^  0;  GO  TO  *; 

j  =  0;  EXIT; 
a.  -  b.  <  0;  PRINT  bj; 

1  J  J* 

j  =  j  -  1; 

j  f  GO  TO  *; 

j  =  0;  i  0;  GO  TO  *; 

i  =  0;  EXIT; 

a .  -  b .  >  0 ;  PRINT  a . ; 

J  J  1 

i  =  i  -  1; 

i  =  0;  GO  TO  *; 

i  =  0;  j  0;  GO  TO  *; 


j  =  0;  EXIT; 
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BASIC  BOOLEAN  "AND"  SEARCH  ALGORITHM  FOR  INVERTED  FILES 

Given  :  a,  a  -  -  -  a 

12  X 

“i  "2  —  V 

Initiate:  i  =  x;  j  =  y; 

*  ;  a.  -  b.  =  0;  PRINT  a.; 

T-  I  1 

i  =  i  -  1; 

i  =  0;  EXIT; 

i  #  0;  GO  TO  *; 

a.  -  b.  <  0; 

1  J 

j  =  j  - 

j  =  0;  EXIT; 

J  ^  0;  GO  TO  *; 

a.  -  b,  >  0; 

1  J 

i  =  i  -  1; 

i  =  0;  EXIT; 

i  0;  GO  TO 

The  process xiig  time  of  the  BASIC  BOOLEAN  "OR"  ALGORITHM  can  be 
estimated  as 


t^  =  (2f)(6u)  =  12fu 


(3) 


where  f  is  the  average  number  of  location  numbers  per  index  term  and  the 
average  length  of  the  loop  is  six.  Similarly  the  processing  time  of  the 


BASIC  BOOLEAN  "AND"  ALGORITHM  can  be  estimated  as 


t^  =  (2f)(4u)  =  8fu 


(4) 


III.  Queuing 

The  basic  system  structure  considered  here  consists  of  a  computer 

with  m  remote  terminals.  In  a  terminal-machine  interaction,  a  user  submits 

a  request  in  the  form  of  a  Boolean  function  of  index  terms  and  after  a 

waiting  period  the  machine  displays  at  time  T  the  first  of  a  list  of  docu- 

F 

ments  satisfying  the  request  and  at  time  T  the  last.  We  will  refer  to  T 

L  F 

and  T^  as  the  first  and  last  response  times.  ET„  and  ET  are  the  expected 
values  of  T^,  and  Tj^.  Initially  we  will  assume  that  requests  are  processed 
one  at  a  time  in  order ‘of  arrival.  When  a  request  is  made  it  is  processed 
immediately  unless  the  computer  is  busy.  In  this  case  the  request  joins  a 
queue.  The  average  search  time  Oi  is  the  time  from  when  a  request  leaves  the 
queue  and  begins  to  be  processed  until  the  last  requested  document  is  dis¬ 
played.  The  average  search  time  obviously  depends  on  detailed  system  char¬ 
acteristics  and  file  organization  It,  as  well  as  ET  ,  will  be  dealt  with 

F 

in  Sections  IV  and  V.  The  purpose  of  this  section  is  to  make  some  ratlicr 
general  statements  about  ET  for  a  given  Of. 

Lj 

Suppose  that  the  m  terminals  operate  independently  and  that  the 
average  terminal  use  time,  that  is,  the  time  from  when  a  terminal  receives 
the  last  document  satisfying  its  request  until  it  submits  another,  is  3- 
The  following  heuristic  argument  relates  ET  ,  a  and  3.  Let  W  be  the  average 

Lj 

time  a  request  spends  in  queue.  Then  the  probability  that  at  an  arbitrary 
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time  a  terminal  is  waiting  for  a  response  in 


g  +  W 

g  +  P  +  w  ■ 


terminals  in  this  condition  is 


g  +  W 
g  +  P  +  W 


m . 


Thus  when  a 


The  fraction  of 
request  arrives  it 


has  to  wait  a  time 


=  a  +  W 
g  +  P  +  W 


m  g , 


(5) 


gm 


Solving  for  W  we  find  that  when  both  m  and  ^  are  large  enough 

P 


W  as  (m  -  1)  g  -  P  (6) 

so  that 

ET-  S!  m  g  -  P  (7) 

■Li 

A  rigorous  treatment  of  this  problem  has  been  given  by  Takacs  [ l] 
who  has  shown  that  if  the  terminal  use  time  has  the  exponential  distribution 
with  mean  P ,  then 


-8(1-  P^.p. 


(8) 


Where  the  factor  1  -  P  ,  is  the  probability  that  at  the  end  of  a  search  the 

m-  i 

queue  is  not  empty.  If  the  minimum  possible  value  of  the  search  time  is  z, 
where  of  course  z  S  g  then 


P  1  ^ 

m-  1 


m- 

1  +  z 


(“:S 

j=l  J  k=l 


-.-1 


(9) 


As  Eq.(9)  shows,  converges  to  zero  as  the  number  of  terminals  m  increases. 

For  the  values  of  m,  z  and  P  in  our  applications  below,  we  can  verify  that 
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P  ,  «s  0.  Thus  we  can  use  Eq.(7)  as  the  relation  between  ET^  ,  cx  and  3- 
m- 1  L 

When  the  system  has  many  terminals,  the  final  response  time  is  much  longer 
than  the  search  time.  This  fact  is  due  to  queuing.  Let  M  be  the  number  of 
requests  waiting  for  service  at  an  arbitrary  time.  Then  from  Eq.(7)  we  may 
conclude  that 

E  M  =  (m-1)  -  3/(y  (10) 


which  shows  that  long  queues  are  common  whenever  m  is  large  compared  with  3 /a. 

The  fact  that  many  requests  are  waiting  with  one  at  a  time  service 

leads  to  a  consideration  of  batch  service.  Let  cx^  be  the  average  search  time 

when  requests  are  serviced  in  batches  of  k.  It  is  reasonable  to  expect  that 

with  batching  we  can  organize  the  search  efficiently  and  find  that  the  average 

time  to  service  a  batch  of  requests  is  less  than  the  time  to  service  k 

requests  one  at  a  time.  Supposing  for  now  that  this  is  true,  that  is,  that 

Qf^  <  k  O' ,  we  must  ask  if  batching  provides  any  improvement  in  response  time. 

To  answer  this  question  we  consider  a  model  like  the  one  at  a  time  model 

except  that  now  requests  are  serviced  k  at  a  time  in  an  average  time  If 

at  the  end  of  a  search  only  p  requests  are  in  queue,  the  system  waits 

until  k-p  additional  requests  arrive  and  then  resumes  operation  Arguing  as 

before  we  find  that  when  P  w  0  we  have 

m  , 


m  cx. 


-3. 


k 


(11) 
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This  shows  that  batch  service  is  better  than  one  at  a  time  service  if  a, 

k 

increases  less  than  linearly  with  k.  In  the  next  two  sections  we  examine 
the  behavior  of  as  a  function  of  k  for  two  types  of  retrieval. 

IV.  Inverted  File 

The  inverted  file  search  consists  of  four  basic  operations. 

1.  Determine  the  locations  of  the  inverted  file  records  corresponding  to 
all  terms  appearing  in  a  given  Boolean  request. 

2.  Access  the  records. 

3.  Compare  the  records  and  identify  the  document  locations  satisfying  the 
search  request. 

4.  Access  and  display  the  bibliographic  data  at  all  locations  found  in 
part  3. 

As  can  be  seen  the  search  has  two  processing  and  file  reading 
phases.  We  will  assume  that  either  a  random  addressing  or  a  core-based 
indexing  scheme  is  used  to  relate  terms  to  inverted  file  records.  When  this 
is  done  the  time  required  for  the  first  processing  phase,  pert  1  is  negligible 
compared  with  that  for  the  rest  of  the  search.  The  second  processing  phase, 
part  3  must  be  examined  in  more  detail.  Let  y  be  the  average  processing  time 
for  a  single  request.  As  before  let  h  be  the  average  number  of  terms  per 
request  and  d  the  average  number  of  terms  per  document.  A  request  is  a 
Boolean  function  of  its  terms.  During  processing  "AND"  and  "OR"  operations 
are  performed  on  the  inverted  file  records  corresponding  to  the  terms  in  a 
request.  Suppose  a  request  has  cwicc  as  many  "ORs"  as  "ANDs".  We  will 
estimate  the  time  for  processing  all  the  "ORs"  and  one  "AND".  The  remaining 
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"ANDs"  should  involve  smaller  lists.  Using  the  algorithm  of  Section  II 

1  (s 

the  time  to  do  the  "ANDs"  is  2/3n(8ku)  =  —  hdu. 

After  an  "AND"  operation  the  combined  list  will  have  roughly  2d 
entries  so  an  "OR"  operation  will  take  a  time  of  about  2d(8u)  =  16du. 
Neglecting  the  time  for  "ANDs"  with  the  remaining  lists  we  can  estimate 
the  average  processing  time  for  our  request  to  be 


V  =  16du[l+h/3] 


(12) 


The  remaining  search  time  consists  of  reading  the  inverted  and 
document  files.  We  will  limit  the  discussion  to  the  case  where  the  two 
files  are  on  separate  disc  units.  Each  unit  has  several  disc  modules  and 
each  module  has  an  independent  read  mechanism.  We  will  also  suppose  that  the 
two  disc  units  are  on  separate  channels.  During  the  search,  a  file  reading 
period  begins  either  at  step  2  or  at  step  4.  In  either  case  the  disc  con¬ 
troller  is  presented  with  a  matrix  of  record  locations  where  is 

the  location  of  the  i^*^  record  to  be  read  from  the  module.  For  simpli¬ 
city  we  will  assume  that  the  same  number  n  of  records  are  read  from  each 
module.  If  the  n  locations  are  distributed  randomly  on  a  module  Chen  the 
probability  that  the  distance  from  one  disc  cdije  to  the  first  location  or 
between  adjacent  locations  is  greater  than  a,  ts  given  by 


=  M 


M-a  ^  (M-a)u- 1  ^  ^  (M-a)u-n-H 


Mu-  1 


Nu-n+1 


(H) 


where  M  is  Che  number  of  tracks  per  disc  and  u  is  the  number  of  records  per 

cylinder.  From  P  (a)  and  the  seek  time  characteristic  for  a  particular 
n 
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disc  unit  we  can  compute  the  average  seek  time  between  locations.  When  n  is 

small  compared  with  M  and  u  the  result  is  essentially  the  same  as  the  average 

distance  between  n  points  randomly  spaced  on  an  interval  [0,m].  This  average 
M 

is  — -r  .  Rather  than  treat  the  case  where  the  locations  are  randomly  dis- 
n+1 

tributed  we  will  assume  that  the  n  locations  for  a  particular  module  are 
M 

evenly  spaced  tracks  apart.  Thus  the  distance  between  the  first  track 

and  the  first  location  or  between  any  two  adjacent  locations  is  ■  h'ow 

suppose  there  are  S  disc  modules  and  that  nS  recoras  are  to  be  read.  The 

following  procedure  is  used.  Each  read  arm  starts  from  one  edge  of  its  file 

and  sweeps  across  to  the  opposite  edge,  stopping  for  record  seek  and  read 
M 

operations  every  tracks.  Let  T(k)  be  the  wirce  for  an  aim  to  move  k 

tracks.  The  S  read  arms  move  simultaneously  to  the  first  S  locations  in 
M 

time  T( — r) .  Now  this  is  followed  by  a  waiting  period  while  she  correct 
n+ 1 

records  rotate  into  position  for  reading.  For  the  inverted  file  we  o.Kpect  the 

records  to  be  long  and  perhaps  occupy  a  whole  track.  If  we  put  several 

m.nrkers  around  each  track  th.en  we  can  login  reading  wnen  the  first  marker 

appears.  The  record  can  be  assembled  into  its  correct  order  in  the  p'o- 

cessor  from  a  knowledge  of  the  marKors.  If  we  have  enough  m...rkers  tiun  t'ne 

waiting  time  to  the  first  one  is  nev;ligible  and  the  rot.it  ie  uii  del.av  tlie  same 

as  the  read  tinie  which  i.s  tlu  disc  rotation  ri;x‘  R  .  Thus  aftci  the  access 

o 

arms  move  to  the  first  .S  loc.it  ions ,  one  arm.  s.iv  A  reads  in  time  R  while  tlie 


o 

others  wait  for  the  channel.  Alter  tending,  arm  A  moves  to  the  n.i'Xt  location 
and  the  next  arm  begins  to  read.  Tlu*  proce.ss  continues  in  this  w.nv  .  If 

I-S 

T( - r)  i  (S-l)k  ,  then  when  arm  A  tinishes  its  second  seek  operation,  the 

n+ 1  ■  o 

channel  will  be  free.  Tbe  same  applies  to  thic  other  .t.  ms  .  In  tn  is  case  tin- 
average  time  to  lead  all  the  inverted  file  iocat  ions  is 
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Tn  “  nT(^)  +  (S-l+n)R^.  (14)' 

M 

Now  if  T(-^)  <  (S-l)R  then  arn  A  finishes  its  second  seek  operation  before 
'iH-l  o 

the  other  anas  have  finished  reading.  In  this  case  all  but  the  first  of  the 
track  seek  times  are  overlapped  by  rotational  delays  and 

"“o 

These  tvo  cases  can  be  written  together  as 

-  T(;^)  +  nSR^  +  (n-l)[T„-(S-l)R  (16) 

n  qtL  o  u  o 

where  [x]  "0  if  x  <  0 

■  X  if  X  i  0 

The  first  term  is  the  initial  track  seek  time.  The  second  term  is  the  total 
rotational  delay.  For  an  inverted  file  with  the  marking  scheme  outlined  above 
nSR^  is  the  total  read  time.  The  third  term  is  the  remaining  track  seek  time 
after  accounting  for  the  overlapping  of  track  seeks  with  rotational  delays. 

The  situation  is  essentially  the  same  with  the  document  file.  Here 
we  would  expect  to  have  several  records  per  track.  The  average  rotational 
delay  is  “  R*  where  f  is  the  fraction  of  a  track  occupied  by  a  record. 

The  average  time  to  read  all  the  document  file  locations  is  given  by  Eq.(16) 
with  R'  in  place  of  R^.  The  rotational  delay  in  document  file  reading  could 
be  decreased  by  using  sector  addressing  as  suggested  by  Wang  and  Ghosh  [3]. 


16 


However  as  we  will  see  in  the  example  below,  it  is  the  reading  time  for  the 
inverted  file  and  not  the  document  file  that  is  the  real  limitation  in  search 
time  reduction. 

The  dependence  of  Eq.(16)  on  the  batch  size  k  can  be  made  explicit 
by  setting  kD  =  nS  where  for  the  inserted  file  the  factor  D  is  the  average 
number  of  terms  per  request  and  for  the  document  file  D  is  the  average  num¬ 
ber  of  documents  satisfying  a  request.  The  average  lookup  time  for  a  batch 
of  k,  L(k)  is  then 


L(k)  = 


+  kDR  + 


(1/) 


where  R  is  or  (^+f)R^  depending  on  the  file.  The  utility  of  batching  is 
in  the  fact  that  only  the  second  term  of  L(k)  increases  linearly  with  k.  The 
other  two  terms  decrease  as  k  increases.  With  batching  the  part  of  file 
lookup  time  due  to  track  seeks  can  be  made  negligible  compared  with  the 
rotational  delay  time. 

When  the  inverted  and  document  files  are  on  separate  channels  parts 
of  the  total  search  can  be  overlapped.  Parts  1-3  of  the  search  can  be  done 
for  one  batch  while  part  4  is  being  done  for  the  other  batch.  Let  Lj.(k)  and 
Ljj(k)  be  the  inverted  and  document  file  look-up  times.  Then  the  time  for  parts 
1-3  of  the  search  is 


L^(k)  +  ky 


(18) 


where  y  is  the  average  processing  time  estimated  above.  This  figure  is  con¬ 
servative  since  some  processing  can  begin  before  all  the  records  are  accessed. 
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The  average  time  for  an  overlapped  search  is  then 


+ 

“k  "  ’’’  ^ 


(19) 


since  parts  1-3  of  a  search  must  be  completed  before  part  4  can  begin. 

In  a  strict  sense  we  cannot  use  in  the  queuing  Eq.(ll).  This 
is  because  there  may  be  times  when  the  queue  is  empty  and  there  is  no  batch 
to  over  lac  with  the  batch  currently  being  serviced.  However  when  P  sb  0 
the  fraction  cf  time  that  the  queue  is  empty  is  negligible  and  we  will  use 
Eq.(19)  to  describe  the  overlapped  search.  We  have  then  that 

ET^^f  {4,(k)+CLj(k)+kv-L^(k)]'^}-  e  (20) 


Rather  than  compute  ET„  directly  we  can  notice  that  it  is  lower  bounded  by 

r 

the  queuing  time  and  upper  bounded  by  ET  so  that 


ET,  -  O',  S  ET„  ^  ET,  . 
L  k  F  L 


(21) 


These  bounds  are  tight  when  m  is  large. 

5  4 

Example  Consider  a  collection  of  8 ' 10  documents  with  10  terms  and  an 
average  of  15  terms  per  document.  This  gives  an  average  of  1200  documents  per 
term.  We  will  suppose  that  the  average  number  of  terras  per  request  is  8  and 
the  average  number  of  documents  satisfying  a  request  is  16. 


The  inverted  and  document  files  are  stored  on  two  separate  IBM  2314 
disc  units.  These  units  have  8  modules  each.  A  module  has  M  =  200  cylinders 
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with  a  rotational  time  =  25  ms.  Allowing  150  bytes  for  a  document  file 
entry  and  3600  bytes  for  an  inverted  file  entry  the  document  file  will  lit  on 
8  modules  with  25  records  per  track  and  the  inverted  file  on  2  modules  with 
one  record  per  track.  The  inverted  file  is  not  packed  densely  since  random 
addressing  is  used. 

Using  these  numbers  we  have  for  the  inverted  file 


Lj(k)  = 


/'2C0 


V4k+L/' 


+  200k 


r  200  ^ 


25 


(22) 


and  for  the  document  file 


L,(k) 


=  T 


200  ^  ,  _  .  4'200  / 

+  2l6k  +  -  9.. 5 


+ 


(23) 


From  the  published  table  for  2314  seek  times  we  can  find  TO, )  and  T (:-  •; v ) 

4k+i  2k+l 

It  turns  out  that  the  [  I"*"  term  in  Eq.(23)  is  zero  for  k  2  1.  Using  the 
linear  approximation  for  the  table  we  find  that 


and 


L^(l)  =  397 

L,(k)  =  220k  +  300  +  25,  k  2  2 

1  4k‘ri 

Lp(k)  =  62  +  +  216k.  1  ^  k  6  3 

=  30  +  |~  +  216k,  k  2  4 


(24) 


(25) 


From  Eq.(l7)  we  can  compute  the  average  processing  time  y  =  94  ms ,  So 
using  Eq.(19)  we  have  for  the  overlapped  search  time 


9 


Qf^  =  Lj(k)  +  kY. 
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(26) 


For  this  example  at  least  does  not  depend  on  Ljj(k)  since  L^(k)  +  ky  >  Lj^(k). 
To  see  the  effects  batching  on  the  response  time  it  is  of  more  interest  to 
compute  We  have 


a, /k  =  491 
k 


k  =  1 


=  314  + 


1200  ^ 
4k+l  k 


k  ^  2 


(27) 


As  k  becomes  large  approaches  314  which  is  just  the  sum  of  the  rotational 

delay  and  processing  time  for  parts  i-3  of  the  search.  Some  values  of 
appear  in  the  table. 


■■ 

1 

2 

3 

B 

5 

6 

7 

8 

9 

10 

- 1 

Of  /k 

k 

491 

460 

415 

391 

J 

376 

_ 

366 

359 

353 

349 

346 

Using  these  values  of  «  /k  we  can  compute  ET  and  plot  it  as  a  function  of  k. 

K  Li 

The  graph  shows  ET.  curves  fo.,  jOO  ,  500  and  100  terminals  with  P  ==  20  sec. 

Li 

V.  Linear  Files  and  Round  Robin  Strategy 

Linear  files  have  been  used  extensively  in  a  batch  mode  for  several 
reasons.  The  serial  nature  of  the  records  makes  it  economical  for  sequential 
processing  which  minimizes  seek  time.  As  random  access  is  not  required,  it  is 
very  attractive  for  systems  with  magnetic  tapes  only. 


C: 


J 
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To  evaluate  the  potential  of  linear  files  in  an  on-line  environment, 
let  us  first  calculate  Qfj^,  the  time  required  for  processing  all  documents 
against  k  queries.  We  will  find  that  file  data  can  be  read  from  tapes  into 
core  faster  than  it  can  be  processed.  Consequently  we  will  assume  that  the 
file  portion  to  be  examined  is  always  in  core  when  it  is  needed.  This  being 
the  case  it  is  seen  that 


(28) 


where  N  is  the  number  of  documents  in  the  entire  file,  and  s^^  is  the  time 
required  for  processing  k  requests  at  once  against  the  terms  of  one  document. 
With  a  two  level  search,  the  first  step  is  to  compare  all  the  kh  terms  of  the 
k  queries  with  the  d  terms  of  a  document.  If  there  are  any  matches  an  exact 
comparison  is  made  at  the  second  level.  From  considerations  in  Section  II  it 
is  seen  that 


where 


®k  =  '^®k^  ^Pl  *^2 


(29) 


kh  d  ,  j  kh 

d-H  FhTT 


(30) 


Pj^  is  the  average  number  of  que.'ies  passing  the  first  level  search,  and  t^  is 
the  processing  time  of  a  single  document  with  one  query.  The  value  of  has 
been  estimated  in  of  (2)  as 


t^  =  4huC  d+2] 


(31) 


I 
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Since 


is  usually  a  small  number 


s,  «=  46,  u 
k  k 


(32) 


From  Eq.(ll)  we  see  that  for  batches  of  k 


ET 


L 


mak  o  4mNuG, 
—  -p  =  — j^k 


-P  =  4mNu[ h 


d 

d+1 


+  d 


kh+1-^ 


-3 


(33) 


From  these  formulas  it  is  clear  that  conventional  systems  based  on  the  linear 
file  give  a  long  last  response  time,  although  batch  processing  does  improve 
the  performance  somewhat. 

However  the  documents  satisfying  a  search  request  can  be  read  out 

as  soon  as  they  are  identified.  For  a  particular  request  the  average  time  to 

the  first  document  is  —  so  we  have 

a 


ET,,  =  ET,  -  a,  + 
F  L  k 


ak 


/"’ll-  1  \ 

^^'k^k  d 


(34) 


.'.s  noted  earlier  we  cannot  use  Eq.(ll)  for  large  values  of  k.  However  for 
large  k  and  hence  large  we  can  argue  that  P  should  be  small  and  Eq.(ll) 

holds  for  p  =  0.  This  fact  is  used  to  obtain  Eq.  33.  The  best  value  of  k  in 
Eq.(33)  is  k  =  m  which  gives 
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ET^ 


a 

M  4Nur  ,  d  ,  .  mh  1 


4Numh 


(35) 


Our  conclusion  is  that  processing  should  be  done  in  such  a  way  that  new 

queries  enter  into  processing  as  soon  as  possible  and  do  not  queue  for 

machine  time,  that  is,  the  batch  size  is  m.  Increasing  the  batch  size  makes 

an  almost  negligible  decrease  in  ET^  but  does  improve  ET„ . 

ij  F 

Example  Consider  a  system  with  m  =  100,  d  =  15,  h  =  10,  N  =  8 ’10^, 

“  6 

u  =  0.5' 10  and  then  we  have  for  k  =  m. 


ETj^  =  25  min 
ET„  =  1.7  min 

r 

The  graph  of  Tig.  2  shows  ETj^  and  ET^,  as  a  function  of  batch  size. 

VI.  Concluding  Remarks 

The  response  time  for  on-line  document  retrieval  systems  has  been 
investigated.  It  is  shown  that  for  systems  with  a  large  number  of  terminals 
the  response  time  is  approximately  linear  with  m,  the  number  of  terminals. 

Two  file  organizations  have  been  evaluated.  It  is  found  that  if  traffic  is 
not  too  heavy  the  inverted  file  seems  adequate.  The  linear  file  is  rather 
slow  in  comparison.  The  general  conclusion  here  is  tliat  conventional  tecli- 
niques  for  document  retrieval  arc  not  adequate  for  on-line  systems  when  ttu- 
number  of  terminals  is  very  large.  For  such  systems  to  be  functional  one 
needs  to  develop  new  and  original  file  organization  and  searcli  techniques. 


» 


j 

i 
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We  wish  to  note  that  when  a  parallel  processor  such  as  the  ILLIAC  IV 
is  available  the  efficiency  of  the  system  can  improve  by  a  factor  at  least 
equal  to  the  number  of  PU's  available.  For  sixty-four  parallel  PU's  the 
system  can  handle  about  one  hundred  times  the  load.  The  additional  improve¬ 
ment  is  obtained  because  of  savings  in  execution  time. 
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'  Appendix 

”  Theorem:  Suppose  we  are  given  two  sequences  of  points  a^,  ---  a^ 

and  bj^,  b^, - b^  obtained  by  ordering  x  and  y  points  each  uniformly  dis- 

^  tributed  between  0  and  M.  Then  if  the  basic  search  algorithm  is  used  to  match 

the  sequences  the  average  number  of  steps  is  given  by 

Proof  of  Theorem 

From  the  probability  that  a  ^  t  the  density  function  for  a  is 

'  XX 

found  to  be 


(I)X 

''M'' 


''M'' 


Similarly  the  density  for  b^  is 


''ir 


The  average  number  of  processing  steps  when  t  >  s  is 


It  is 


when 


y  +  (x-L)  ^ 
X  +  (y-1)  I 

s  >  t 


I 
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The  expected  number  of  processing  steps  is,  therefore,  given  by 


y-l 


(y-i)t 


)ds] dt 


It  is  seen  that 


^  y(x+y)  t^ 
(y+1)  v,y 


Therefore 


=  „(_i5_)  +  w_JL) 


Average  royponse  I  i.i 


/lO 


T 


T 


m  =  100 


10 
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